The white wine quality data set from Cortez et al. (2009) is explored. The wines are classified as Vinho Verde and are exclusively produced in the demarcated region of Vinho Verde in northwestern Portugal. These wines are described to possess “vibrant freshness, elegance, lightness and aromatic and flavorful expressions.” The paper and data can be found here:
P. Cortez, A. Cerdeira, F. Aloesseida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.
The attributes included in the data set are described below (taken from “wineQualityInfo.txt”):
Based on the descriptions of Vinho Verde wines and the attributes in the data set, the features that are predicted to positively contribute to quality are:
The features that are predicted to negatively contribute to quality are:
The following analysis explores the attributes in a systematic manner. The features that mainly influence quality are then further investigated.
## 'data.frame': 4898 obs. of 14 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
## $ volatile.acidity : num 0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
## $ citric.acid : num 0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
## $ residual.sugar : num 20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
## $ chlorides : num 0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
## $ free.sulfur.dioxide : num 45 14 30 47 47 30 30 45 14 28 ...
## $ total.sulfur.dioxide: num 170 132 97 186 186 97 136 170 132 129 ...
## $ density : num 1.001 0.994 0.995 0.996 0.996 ...
## $ pH : num 3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
## $ sulphates : num 0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
## $ alcohol : num 8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
## $ quality : int 6 6 6 6 6 6 6 6 6 6 ...
## $ score : Factor w/ 7 levels "3","4","5","6",..: 4 4 4 4 4 4 4 4 4 4 ...
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1 Min. : 3.800 Min. :0.0800 Min. :0.0000
## 1st Qu.:1225 1st Qu.: 6.300 1st Qu.:0.2100 1st Qu.:0.2700
## Median :2450 Median : 6.800 Median :0.2600 Median :0.3200
## Mean :2450 Mean : 6.855 Mean :0.2782 Mean :0.3342
## 3rd Qu.:3674 3rd Qu.: 7.300 3rd Qu.:0.3200 3rd Qu.:0.3900
## Max. :4898 Max. :14.200 Max. :1.1000 Max. :1.6600
##
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.600 Min. :0.00900 Min. : 2.00
## 1st Qu.: 1.700 1st Qu.:0.03600 1st Qu.: 23.00
## Median : 5.200 Median :0.04300 Median : 34.00
## Mean : 6.391 Mean :0.04577 Mean : 35.31
## 3rd Qu.: 9.900 3rd Qu.:0.05000 3rd Qu.: 46.00
## Max. :65.800 Max. :0.34600 Max. :289.00
##
## total.sulfur.dioxide density pH sulphates
## Min. : 9.0 Min. :0.9871 Min. :2.720 Min. :0.2200
## 1st Qu.:108.0 1st Qu.:0.9917 1st Qu.:3.090 1st Qu.:0.4100
## Median :134.0 Median :0.9937 Median :3.180 Median :0.4700
## Mean :138.4 Mean :0.9940 Mean :3.188 Mean :0.4898
## 3rd Qu.:167.0 3rd Qu.:0.9961 3rd Qu.:3.280 3rd Qu.:0.5500
## Max. :440.0 Max. :1.0390 Max. :3.820 Max. :1.0800
##
## alcohol quality score
## Min. : 8.00 Min. :3.000 3: 20
## 1st Qu.: 9.50 1st Qu.:5.000 4: 163
## Median :10.40 Median :6.000 5:1457
## Mean :10.51 Mean :5.878 6:2198
## 3rd Qu.:11.40 3rd Qu.:6.000 7: 880
## Max. :14.20 Max. :9.000 8: 175
## 9: 5
There are 4898 white wines in the data set with 11 real features (“fixed.acidity”, “volatile.acidity”, “citric.acid”, “residual.sugar”, “chlorides”, “free.sulfur.dioxide”, “total.sulfur.dioxide”, “density”, “pH”, “sulphates”, and “alcohol”). The analysis of the data set is centered on how these features are related to the “quality” of a wine. An extra variable, “score”, is an ordered factor version of the “quality” feature. The “score” ranges from 0 - 10 (best).
Most wines are rated in the middle to mid-high (5-6), with a median “quality” of 6. Most features appears to have large outliers. “density” and “pH” might be exceptions. These characteristics might be easier to measure accurately than the other features. The median “residual.sugar” content is 5.200 \(g/dm^3\) and the median “alcohol” content is 10.51 vol.%.
A histogram of wind qualities can show if there is a skew in the rating of wines and if they tend to be underrated or overrated.
The quality of white wines appear to follow somewhat of a normal distribution. The scale is from 0 - 10, but the lowest score given was a 3 (20 wines) and the highest was a 9 (5 wines). Are there common profiles for the worst and best wines?
A set of base histograms are created for all attributes.
The plots above show the distribution of all of the features. The base histograms show that “fixed.acidity”, “citric.acid”, “total.sulfur.dioxide”, “density”, “pH”, and “sulphates” are normally distributed while “volatile.acidity”, “residual.sugar”, “chlorides”, “free.sulfur.dioxide”, and “alcohol” have skewed distributions. However, binwidths and axes need adjustment in order to find any unexpected distributions. Histograms of features that provide significantly more information than the base histograms are presented below.
There appears to be a ~0.3 \(g/dm^3\) peak and a ~0.5 \(g/dm^3\) spike. It would be interesting to know which wines have a citric acid content ~0.5 \(g/dm^3\).
There appears to be a bimodal distribution for residual sugar content. There are probably wines for people who prefer drier wines and for others who prefer sweeter wines. It would be interesting to know the properties of these two subsets.
There is a long tail of higher chloride concentrations for the lower quality wines. Is this possibly an important feature to distinguish wine quality?
The lower quality wines tend to have lower free sulfur dioxide concentrations. Perhaps wine with more free sulfur dioxide (less oxidized) taste better?
The higher quality wines tend to have more alcohol content. Could it be that alcohol makes wine taste better?
A set of box plots are created for all features. The data is limited to the middle 90%. The mean for each category is plotted as an “x”. Creating such boxplots should help see the trend of each feature by wine “quality” and also see the largest range of a feature at “quality: level.
The features that appear to vary with wine quality are “volatile.acidity”, “citric.acid”, “residual.sugar”, “chlorides”, “free.sulfur.dioxide”, “total.sulfur.dioxide”, “density”, “pH”, and “alcohol”. It seems that the variation within a feature is more obvious through these set of box plots than the previous set of histograms. The top features to affect “quality” appear to be “density” and “alcohol”. The medians and means of “fixed.acidity” and “sulphates” hardly vary with “quality”. Thus, these features will not be explored further.
Scatterplots and correlation calculations of the characteristics of wine that might be more closely associated to quality should help ascertain which features might be important to quality. Additionally, how the individual features correlate with each other will be investigated. If certain features are strongly related, a feature may be removed to avoid redundancy in a model. Which features are strongly related is also explored.
The “alcohol” feature has a strong correlation with wine quality. The “density” feature has a moderately strong correlation with the quality of wine. “residual.sugar” presents a bimodal distribution, and would not necessarily have a strong linear correlation with wine quality. It will be explored further since “density”, and “alcohol” have strong correlations with “residual.sugar”. In fact, the correlation between “density” and “residual.sugar” is the strongest in this set. Similarly, “total.sulfur.dioxide” has a strong correlation with “density”. “alcohol” also has a strong correlation with “chlorides” and “total.sulfur.dioxide”. The relation of these variables with each other and quality will be analyzed more closely in the next section.
There is a strong negative linear relationship between “density”" and “alcohol”. As “alcohol” increases, “density”" decreases. The higher quality wines tend cluster in the lower right region, which is the higher alcohol content, lower “density” wines. It would be interesting to see if this ratio is a better feature.
Since “density” and “alcohol” are highly correlated with “residual.sugar”, the same plot is created (colored by “residual.sugar” content instead of “score”) in order to better understand the relationship between these features. There appears to be a clear separation between wines with higher “residual.sugar” and lower “residual.sugar”. The higher “residual.sugar” wines tend to have a higher density:alcohol ratio.
The “residual.sugar” decreases with increasing “alcohol”. However, there isn’t a strong linear relationship between “residual.sugar” and “alcohol” for the whole range. An exponential decay in “residual.sugar” appears to exist with increasing “alcohol”. The higher “quality” wines tend to have a higher “alcohol” and lower “residual.sugar” content. If there is a strong relationship between “residual.sugar” and “alcohol”, perhaps only the “alcohol” feature needs to be kept for future analysis.
There is a negative linear relationship between “total.sulfur.dioxide” and “alcohol”. As “alcohol” increases, “total.sulfur.dioxide” decreases. However, the relationship between the two and “quality” is not clear. For now, both features may be important to keep.
There is a negative linear relationship between “chlorides” and “alcohol”. As “alcohol” increases, “chlorides” decreases. However, the relationship between the two and “quality” is not clear. For now, both features may be important to keep.
The “residual.sugar” increases with increasing “density”. The relationship between “residual.sugar” and “density” appears to be more of an exponential growth. The higher “quality” wines tend to have a higher “residual.sugar”:“density” ratio. This ratio will be explored in a later section.
There is a linear relationship between “total.sulfur.dioxide” and “density”. As “total.sulfur.dioxide” increases, “density” increases. This plot also shows that higher “quality” wines tend to have lower “density” values. However, the relationship between the two and “quality” is not clear. For now, both features may be important to keep.
The previous scatterplots showed that the “density”:“residual.sugar” and “density”:“alcohol” may be important transformed features. Here, the correlation of the ratios with quality is examined.
The correlation between “density”:“residual.sugar” and “quality” is low, showing that it is not an important feature for quality. The correlation between “density”:“alcohol”" and “quality” is is a strong negative correlation. However, the positive correlation between “quality” and “alcohol” is stronger. As a result, neither of these ratios will be analyzed further.
Now that the secondary attributes are investigated, the median attributes of each feature by quality are investigated. Using the medians can help see overall trends in the data, that can be lost in scatterplots of the full data set. This section is used to confirm interpretations of previous analyses.
Medians of “chlorides”, “total.sulfur.dioxide”, “density”, “pH”, and alcohol“, appear to have strong correlations with”quality“. While a linear model may not be appropriate for wine”quality“, this plot shows that there are strong trends with these features and”quality“. It was surprising to see such a change in the correlations of”chlorides“,”total.sulfur.dioxide“, and”pH" with “quality” when the medians were used. Without these results, the previous analysis may have ruled them out.
In a previous histogram, there was a secondary “citric.acid” peak at 0.49 \(g/dm^3\). Here the wines with that property are analyzed in order to see if they are different from the average wine and if the “citric.acid” feature needs to be reconsidered.
The secondary peak is at 0.49 \(g/dm^3\), which is much greater than the average concentration found at every quality level.The citric acid supposedly adds ‘freshness’ and flavor to wines. There are 215 wines with this acidity, and the majority of them are mediocre wines with average sugar content, average alcohol content, and average density (compared to global averages). Thus, the “citric.acid” features remains to be largely irrelevant for wine “quality”.
For this part of the analysis, the wines are separated into two sets. A set of wines that has 4 \(g/dm^3\) “residual.sugar”" content or less, and a set of wines that has more than 4 \(g/dm^3\) “residual.sugar”. A previous histogram showed a bimodal distribution of “residual.sugar” content. This analysis is done to see if the wines in the data set should be separated by this feature.
## volatile.acidity citric.acid residual.sugar
## 0.26782785 0.32875060 1.82293753
## chlorides free.sulfur.dioxide total.sulfur.dioxide
## 0.04411016 29.90295660 120.12398665
## density pH alcohol
## 0.99182044 3.21308059 11.00906056
The low “residual.sugar” subset has slightly higher correlation magnitudes between “quality” and most other attributes, on average. However, the magnitude of correlations of all other attributes are generally lower in this subset than the ones in the full data set. The “residual.sugar” correlations with other features are surprisingly much much lower in this subset. The average attributes of this subset are mostly similar to the average attributes of the full data set. The exceptions include: lower “residual.sugar”, “free.sulfur.dioxide”, and “total.sulfur.dioxide” values. It might be useful to separate the wine into a low “residual.sugar” subset since some statistics are different.
## volatile.acidity citric.acid residual.sugar
## 0.28603713 0.33826491 9.81165655
## chlorides free.sulfur.dioxide total.sulfur.dioxide
## 0.04701678 39.35469475 152.01374509
## density pH alcohol
## 0.99567963 3.16968940 10.14383434
This subset does not appear to be significantly different from the full data set (other than “residual.sugar”) even though it contains only about 57% of the wines. This subset does have slightly higher means in “chlorides” and “free.sulfur.dioxide”.
This section explores wines with a rating of 4 or less in order to see their common characteristics.
## volatile.acidity citric.acid residual.sugar
## 0.37598361 0.30770492 4.82103825
## chlorides free.sulfur.dioxide total.sulfur.dioxide
## 0.05055738 26.63387978 130.23224044
## density pH alcohol
## 0.99434306 3.18338798 10.17349727
The worst wines have higher “volatile.acidity” and “chlorides” and lower “citric.acid”, “residual.sugar”, and “free.sulfur.dioxide” than the average of all wines. Many of these properties were not viewed as main features to predict wine “quality”.
This section explores wines with a rating of 8 or better in order to see their common characteristics.
## volatile.acidity citric.acid residual.sugar
## 0.27797222 0.32816667 5.62833333
## chlorides free.sulfur.dioxide total.sulfur.dioxide
## 0.03801111 36.62777778 125.88333333
## density pH alcohol
## 0.99221439 3.22116667 11.65111111
The best wines have lower “chlorides” and more “alcohol”, which are features that were seen to be important in predicting “quality”.
A linear model does not seem appropriate for predicting the quality of a wine since “quality” is a categorical variable. In fact, Cortez et al. (2009) use Support Vector Machine (SVM) to predict the quality of wine. The ultimate outcome of this analysis is to highlight the main features of Vinho Verde white wine quality.
The features that contribute to Vinho Verde white wine “quality” the most are “alcohol” and “density”. “residual.sugar”, “chlorides”, “total.sulfur.dioxide”, and “pH” may also be the next important features. These features are different from the predicted features (“citric.acid” and “volatile.acidity”) that were based on the Vinho Verde description and features descriptions. The features that were predicted to be important were “density” and “total.sulfur.dioxide”. Correlations between the features were examined and interesting subsets were further analyzed.
Alcohol content appears to influence wine “quality” positively, if the wine is at least mediocre (quality level of 5 and greater). Generally, as “alcohol” content increases, wine quality also increases. The plot shows the range of the “alcohol” content for each “quality” level and how both the median and means of “alcohol” content increase with “quality”. There is quite a spread of the data for “alcohol” at each “quality” level, which signifies that it cannot be the only feature to predict wine “quality”.
“density” and “alcohol” have the largest correlation with “quality”, but there is an underlying relationship between “density”, “alcohol”, and “residual.sugar”. The top plot shows how “density”" decreases with increasing “alcohol”" content. Additionally, there is a clear separation between higher and lower “residual.sugar” content. While “residual.sugar”and “density”:“residual.sugar” did not highly correlate with “quality”, this plot shows that there is a separation in “quality”. Since these three features are highly correlated, perhaps not all three features should be included in predicting wine “quality”. The calculated correlations are:
0.84 for “residual.sugar” & “density”,
-0.78 for “alcohol” & “density”, and
-0.45 for “alcohol” & “residual.sugar”.
This last plot shows that there are two subsets of sugar content - wine for people who prefer sweeter wines or drier wines. It was analyzed that other than sugar content, the wines with higher “residual.sugar” were not significantly different from the average of all wines. However, the wines with lower “residual.sugar” had lower “free.sulfur.dioxide”, and “total.sulfur.dioxide” value, on average.
The most difficult part of the analysis was recognizing that wine “quality” could not be treated as a continuous variable, like diamond price. At first, I was trying to find which feature transformations would lead to the highest correlation with “quality”. Since there are only seven “quality” categories, correlations do not tell the whole story. After this realization, the analysis became easier and I was able to ask interesting questions. I think finding the interesting subsets (citric acid peak, and residual sugar subsets) will be useful for future analysis and applying a machine learning algorithm.